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Efficient automatic protein classification is of central importance 
in genomic annotation. As an independent way to check the reliabil- 
ity of the classification, we propose a statistical approach to test if 
two sets of protein domain sequences coming from two families of the 
Pfam database are significantly different. We model protein sequences 
as realizations of Variable Length Markov Chains (VLMC) and we 
use the context trees as a signature of each protein family. Our ap- 
proach is based on a Kolmogorov-Smirnov-type goodness-of-fit test 
proposed by Balding et al. [Limit theorems for sequences of random 
trees (2008), DOI: 10.1007/sll749-008-0092-z]. The test statistic is a 
supremum over the space of trees of a function of the two samples; its 
computation grows, in principle, exponentially fast with the maximal 
number of nodes of the potential trees. We show how to transform 
this problem into a max-flow over a related graph which can be solved 
using a Ford-Fulkerson algorithm in polynomial time on that num- 
ber. We apply the test to 10 randomly chosen protein domain families 
from the seed of Pfam-A database (high quality, manually curated 
families) . The test shows that the distributions of context trees com- 
ing from different families are significantly different. We emphasize 
that this is a novel mathematical approach to validate the automatic 
clustering of sequences in any context. We also study the performance 
of the test via simulations on Galton-Watson related processes. 
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1. Introduction. The primary structure of a protein is represented by a 
sequence of 20 different symbols called amino acids. Proteins can be com- 
posed of one or more functional regions, called domains; the identification 
of domains that occur within a protein can provide insights into its func- 
tion. For this reason biologists classify protein domains into families and 
care about the reliability of the classification [Stein (2001)]. But in a protein 
domain database not only the quality of the classification is important, the 
number of proteins encoded by the genomes that are assigned to the fami- 
lies is also important. This is usually referred to as proteome coverage. For 
this reason, usually in most databases there must be some balance between 
quality and quantity. 

The Pfam database is a large collection of protein domain families [Finn 
et al. (2006)]. In its last release of July 2007, the Pfam database comprises 
9318 annotated families (Pfam-A) as well as a lower quality, unannotated 
collection (Pfam-B). Each Pfam-A family consists of two parts: a manually 
curated set of protein domains called seed and a set of automatically detected 
protein domains using a profile hidden Markov model (profile HMM), whose 
parameters are estimated from the seed of the family. 

To our knowledge, no independent method to validate the Pfam classifi- 
cation has been proposed, in spite of problems that the uncertainty in the 
alignment of sequences can lead to [Wong, Suchard and Huelsenbeck (2008)]. 
We make a step in this direction by presenting a statistical method to test if 
two samples from protein domains come from two different families. If some 
families are not significantly different, then the problem of classifying new 
proteins becomes risky. 

We start by modeling protein sequences as Variable Length Markov Chains 
(VLMC), a model introduced by Rissanen (1983). A VLMC is a discrete 
time stochastic process with the property that the law of the process at any 
given time depends on a finite (but not of fixed length) portion of the process 
at precedent times [Biihlmann and Wyner (1999)]. As usual in the applica- 
tions of VLMC, we assume that the process is a Markov chain of order 
at most L (finite memory process). The minimum set of sequences needed 
to completely specify the distribution of the next symbol in the sequence 
is known as a context tree and it is denoted by t. Calling p the conditional 
transition probabilities associated to the nodes oft, the pair {t,p) completely 
determines the law of the VLMC. 

VLMC have been successfully applied to model and classify protein se- 
quences [Bejerano and Yona (2001)]. As in the case of profile HMM in 
the construction of the Pfam families, the VLMC approach of Bejerano 
and Yona takes, for each family, a set of already classified protein domains 
and estimates a VLMC model, that is, a pair {t,p). Then, the estimated 
VLMC model is used to classify other protein sequences into the family. 
Instead, we treat the context trees of sequences of a given family as random 
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samples of a distribution associated to the family; this distribution is used 
as a signature of the family [Galves et al. (2004), Leonardi et al. (2007)]. 
That is, we propose that the context trees of the sequences disregarding 
the associated probabilities are sufficient to test if two samples of sequences 
come from different Pfam families. We take two samples of protein sequences 
of size n and m respectively and for each sequence we construct the esti- 
mated context tree using the PST algorithm introduced by Ron, Singer and 
Tishby (1996) and implemented by Bejerano (2004), obtaining two samples 
of trees t = (ti, . . . ,t„), t' = {t'l, . . . We assume that the samples are 
independent and that the trees in each sample are independent and identi- 
cally distributed with laws tt and vr', respectively. We test Hq : vr = tt' against 
Ha : tt 7^ vr' using the test proposed in Balding et al. (2008) (in what follows 
we will denote it by BFFS test). Rejection of the null hypothesis leads us 
to conclude that the protein families are distinct. 

The BFFS test is a Kolmogorov-Smirnov-type goodness-of-fit test. A 
distance d defined later in (8) is considered in the space of trees T and the 
statistic for the two-sample test is given by 

(1) l^(t,t') :=sup|d(t,t) -(i(t,t')|, 

where d{t,t) = ^ Y^^=i d{t,ti); that is, W{t,t') is the supremum over t in the 
space of trees T of the difference of the empiric mean distances of t to each 
of the two samples t and t'. The null hypothesis is rejected for large values of 
W{t,t'). Since the law of W under Hq is not explicitly known, a simulation 
procedure is performed to find the p- values. 

The computation of the test statistic VF(t,t') is a priori difficult; a naive 
search would involve an exponential complexity of the algorithm on the 
number of potential nodes. A major point of this paper is to show that 
the problem can be re-expressed as to find the maximal fiow on a graph 
constructed as a function of the samples. The approach is inspired by the 
search for the Maximum a Posteriori in Bayesian image reconstruction using 
the Ising model, as proposed by Greig, Porteous and Seheult (1989) [see also 
Kolmogorov and Zabih (2004)], but requires the introduction of a penalty to 
guarantee that the solution is in T. The max-flow problem can be solved in 
polynomial time on the maximal number of nodes of the tree, using Ford- 
Fulkerson type algorithms. 

Statistical analysis of tree-like data has been performed in several pa- 
pers. Banks and Constantine (1998) obtain trees by hierarchical clustering 
of authors of written texts, using search-related features. They assume a 
parametric model and use a metric in the space of trees to get a center 
point and a confidence band around it. Computation of the distribution's 
parameters, center point and spread are feasible when a distance of the same 
type as in the BFFS approach is used. Wang and Marron (2007) analyze 
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a sample of blood vessels in the human brain, represented by trees. Each 
node represents a blood vessel, and the edges represent vessel connections. 
The statistical analysis of this data was based on a Frechet approach, which 
in turn is based on a metric. The Frechet mean of a data set is the point 
which minimizes the sum of the squared distances to the data points. In 
the specific application to blood vessels, both the structure of the trees (i.e., 
connectivity properties) and the attributes of the nodes (such as the loca- 
tions and orientations of the blood vessels) were considered. This is a major 
difference with the BFFS approach, where only the structure of the trees 
enters in the test statistics. 

In Section 2 we define trees, describe VLMC and explain how to obtain 
the context trees from the observed protein domain sequences. In Section 
3 we define the distance in the tree space and describe the BFFS test. In 
Section 4 we develop the algorithm to compute the BFFS test statistics. In 
Section 5 we describe the one-sample test and discuss possible extensions of 
the approach. In Section 6.1 we perform pairwise comparisons of samples of 
trees corresponding to 10 Pfam families. Final remarks are in Section 7 and 
computing notes in Section 8. At the end of Section 3 and then in Appendix 
A.l, we discuss problems related to the power of the tests. Appendix A. 2 
contains the proofs of selected results. In Appendix A. 3 we perform the test 
on several samples of Galton-Watson related trees obtained with Monte 
Carlo simulation. 

2. Protein related random trees. A protein sequence can be modeled as 
a realization of a discrete time stochastic process having as state space the set 
A of 20 amino acids. This is the basic idea in the modeling of protein domains 
by HMMs or VLMCs. In this section we introduce the basic concepts behind 
VLMC and show how the context tree associated to a protein sequence can 
be estimated using the Probabilistic Suffix Tree (PST) algorithm proposed 
by Ron, Singer and Tishby (1996) and implemented by Bejerano (2004). 

Let ^ be a finite alphabet and V = lJfc=o the set of sequences of symbols 
in A. Denote the sequence a^a^+i • • • a^. Given a sequence a{, any sequence 
a;^ with 1 < i < j is called a suffix of a{. Let T := {t C V : ai G t implies 

€ t} be the space of rooted trees with nodes in V; the empty sequence is 
the root of the tree and it is called A. The edges of t are {(a-{,a2) :a{ € t}. 
A node of an edge is a suffix of the other node and the difference in length 
of the two nodes is one. Hence, the tree t is identified with its set of nodes. 

Let X = {Xn)nez be a stationary stochastic process taking values in A. 
Define 

p{a\aZ]) := P[Xo = a\Xzj = al]]. 
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A finite sequences € F is sufficient to determine the law of the next 
symbol if 

(2) p{-\aZ''f^aZl) = pi-\aZl) for all k<j and all aZ'^f^ G 

where al^~^al^ denotes the concatenation of the sequences and a~^. 

We assume that the process is a Markov chain of order L, that is, that (2) 
holds for k = L and all aZ\ G A^- A finite sequence aZl. G is called a 
context if it satisfies (2) and 

(3) p{-\aZl)^p{-\azU,). 

We say that X is a VLMC if there are contexts of length less than L, that 
is, if there exists a k < L and a sequence aZ]. satisfying (2) and (3). The set 
t of contexts and all their suffixes is called the context tree; each node in 
the context tree is labeled by a finite sequence over A. In this finite memory 
case, a VLMC is simply a parsimonious representation of a Markov chain 
of order L that, in its strict sense, would have (|^| — 1)|^|^ parameters (a 
probability distribution associated to each sequence of fixed length L). 

Under the assumption of bounded memory, the pair composed by the 
context tree and the set of transition probability distributions associated to 
the nodes of t completely specify the law of the stationary process X. 

Figure 1 summarizes an example of a context tree and transition probabil- 
ities for a stationary process X over the alphabet A = {1,2}. For < a < 1 
the transition probabilities are given by 

(a, if al^ = 111 or al;^ = 122, 

(4) p{l\aZi,) = < 1 - a, if al;^ = 211 or al;^ = 222, 

1 0.5, otherwise; 

see Figure 1(a). If a / 0.5, the set of contexts is {111,122,211,222} and 
the context tree is t = {A, 1, 2, 11, 22, 111, 211, 122, 222}; see Figure 1(b). If 
a = 0.5, the context tree is just A, as the chain is a sequence of i.i.d. (0.5) 
random variables. 

There are several approaches to estimate the context tree and transition 
probabilities of VLMCs from a finite realization of X . We mention the con- 
text algorithm proposed by Rissanen (1983); see also Biihlmann and Wyner 
(1999) and Calves et al. (2008). Recently, Csiszar and Talata (2006) pro- 
posed the use of the Bayesian Information Criterion (BIC) and the Minimum 
Description Length Principle (MDL). These algorithms provide consistent 
estimates of the parameters. Our work utilizes the PST algorithm and so 
we provide a brief review here. 

Suppose xi, . . . ,xi is a sample of a VLMC over A specified by the pair 
{t,p) (in our setting xi, ■ ■ ■ ,xi represents a protein over the alphabet of 20 
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222(1-Q,a) 222 



Fig. 1. An example of stationary conditional probability distributions p over the alpha- 
bet A = {1,2} (a) and the corresponding context tree t (b). p and t completely specify a 
stationary VLMC process X = (X„)„gz. We assume < a < 1, a ^ 0.5. Each node in the 
tree (a) is labeled by a sequence over the alphabet A and has an associated probability dis- 
tribution over A (see text for more details). The contexts of the process are the sequences 
in boldface {111,211,122,222} (note that 21 and 12 are not contexts in our definition). 
The context tree in this case is (b), representing the set {A, 1, 2, 11, 22, 111, 211, 122, 222}. 



amino acids). For any sequence a\ G define the counters 

(5) iV(ai) = 2l{xSl = ai}, 

where the function 1 takes value 1 if x^^\ = a\ and otherwise. For any 
sequence a\ € A' such that N{a\) > 1 and any symbol a € A, we define the 
empirical transition probabilities 'p{a\a\) as 

(6) P{a\a{)- ^ 1 ' 



To estimate the context tree associated to the sequence, two parameters 
are fixed: L, the maximal depth of the estimated tree t and r > 1, a threshold 
value. The PST algorithm defines the context tree estimator t as the tree 
containing all the sequences aj, with j < L, A^(a{) > 1, such that there exists 
a symbol a (z A satisfying 

(7) I logp(a|a2) — logj5(a|a{)| > logr. 

That is, the node a\ is a node of t if the conditional probabilities p(-|a{) and 
p{-\a2) are sufficiently far in the sense of (7); this is the empirical version 
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of (2)-(3). To guarantee that t is a tree, include also all suffixes of included 
nodes; that is, o{ S t implies a!^ € t. 

The PST algorithm uses other parameters to smooth the estimated tran- 
sition probabilities given by (6). This smoothing is useful to avoid null esti- 
mated probabilities that can damage the prediction step in the classification 
of new sequences. Since our interest is to estimate only the context tree, it is 
sufficient to consider the parameters L and r. We refer the interested reader 
to Ron, Singer and Tishby (1996), Bejerano (2003) and Bejerano (2004) for 
a full explanation of the PST algorithm, its implementation and some basic 
examples. 

3. The tree distance and the two-sample test. For the two sample prob- 
lem, the BFFS test is based on the supremum (over the space of trees) of 
the difference between the empirical mean distance function of each sample 
to a given tree. More formally, for a node v & V and a tree t in T denote 
t{v) = l{v is a node of t}. Let (j):V ^ M"*" be a nonnegative function and 
consider the distance in T defined by 

(8) d{t,t'):=Y,<t>{vmv)-t'{v)f. 

Let T be a random tree on T with law tt and t = (ti,...,t„) a random 
sample of T (independent random trees with the same law as T). Define the 
empiric expected distance of a tree t to the sample by 

1 " 

(9) d{t,t):=-Y^d{ti,t). 

1=1 

Consider two samples t and t' of random trees T and T', with laws vr and 
vr', with sample sizes n and m respectively. The two-sample problem is to 
test 

(10) Ho:tt = tt\ Ha-tt^tt'. 
BFFS show that, under Hq, the process 

(11) ( /^^( J(t, t) - d{t, t')) , t G t) 
^ \ n + m ' 

converges weakly as min(n, m) ^ oo to a Gaussian process and propose the 
statistic 

(12) W^(t,t') :=sup|J(t,t) -d(t,t')|. 

Under for large n and m, y^ J^ W(t,t^) has approximately the law of 
the supremum over t of a Gaussian process indexed by t € T. Determining 
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the quantiles using the asymptotic law, the null hypothesis is rejected at 
level a when 

I w^(t,t') !><?„. 

The quantiles are obtained using permutation-based randomization tech- 
niques [Manly (2007)]. See Section 6.1 for details. 

About the power of the BFFS test. Strictly speaking, the null hypothesis 
of the BFFS test is 

H'^-Asm of (d(i,r),t eT) =law of {d{t,T'),t £T). 

Of course, rejection of Hq implies rejection of Hq, so that we do not need to 
worry when rejecting. But the test could accept H'q even when Hq is false. 
We give in Appendix A.l examples of different vrs giving rise to the same 
process {d{t, T),t gT) and show some sufficient conditions on vr under which 
the law of {d{t,T),t € T) determines vr. 

4. Graph computation of W{t, t'). To compute the test statistic W{t, t'), 
it is necessary to find the trees attaining the supremum (12). In this sec- 
tion we show that the problem can be reformulated in terms of finding the 
maximal flow on a graph constructed as a function of the sample. 

Denote by t the empiric mean of the sample t: 

(13) tiv):=^J2i^{v). 

1=1 

Since t{v) is not always an integer, t is not necessarily a tree, but t{av) < t{v) 
if V and av are in V. Notice that 

(14) (J(t,t) - d{t,t') = 2C{t) + 4>{v){t{v)-t'{v)), 
where 

(15) m = T.^i^)(t'(^)-ti^M^)- 

Since the last term in (14) does not depend on t, to maximize \d{t, t) — d{t^ t') | 
on T is equivalent to minimize C{t) and —C{t) on T. 
Define the space of configurations 

This set can be identified with the set of subsets of V , so that T <zy. 
In order to penalize configurations in y that are not trees, we define the 
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quadratic function V :y which counts the number of orphan nodes in 
a configuration: 

(16) V{y)= J2 y{av){l-y{v)), y G 3^, 

{v,av}GV 

where for a node v = ai ■ ■ ■ aj, av = aai ■ ■ - aj is a son of u. It is clear that 
7'(y) > and 'P{y) = if and only if ?/ G T. 

The following proposition shows that to maximize \d{t,t) — d{t,t')\ on 
T is equivalent to minimize C{y) + (3V{y) and —C{y) + (3V{y) on 3^ for /3 
sufficiently large. 



Proposition 4.1. Let j3 > Y.v^v ^i'") ■ Then 
(17) 



argm^ |(i(t, t) — d{t,t')\ 



C argmin(£(y) + (3V{y)) U argmin(-£(y) + (3V{y)). 
y&y y&y 

The proof of Proposition 4.1 is given in Appendix A. 2. The proposition re- 
duces the minimization problem on T to the task of minimizing the Hamil- 
tonians C + (iV and —C + fSV on 3^. _ _ 

The Hamiltonian C + (3V is represented by the oriented graph {V,E) 
given by 

(18) V:=VU{s}U{b}, E := {{v,w) :v,w eV}, 

where s (source) and b (sink) are two extra nodes. The graph has the fol- 
lowing capacities associated to the (oriented) edges: 

{{(j){v){t{w) -¥{w)))^, iiv = sandweV, 

{(l)iv){t{v)-?iv)))-, if veV and w = b, 

13, iiveV,w = aveV,aeA, 

0, otherwise, 

where = max{x, 0}, x~ = max{—x, 0}. That is, the edges linking a node 
of V to its sons have capacity f3, the edge linking a node of V to the sink, and 
the edge linking the source to a node of V have capacity (f>{v){t{v) — t'{v)))^ 
according to the sign of {t{v) — t'{v)); the other edges have zero capacity. 
A configuration y defines a cut of the graph 

(20) C{y) := {{v,w)eE: c{v, w) > 0,v e {V \y) U {s}, weyU {b}}, 
whose capacity c{y) is 

(21) c{y):= 

{v,w)eC(y) 

The next result has been proven by Kolmogorov and Zabih (2004), as a 
generalization of an approach of Greig, Porteous and Scheult (1989). We 
give some details of the proof in Appendix A. 2. 
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Proposition 4.2. It holds that c{y) = k + C{y) + l3V{y) for all y ey, 
where k does not depend on y. 

Proposition 4.2 shows that to minimize C{y) + [3V{y) it is sufficient to find 
a minimum cut in its associated graph. This problem can be solved by means 
of the Ford-Fulkerson type of algorithm as proposed by Greig, Porteous and 
Scheult (1989). We use the variant and implementation of Kolmogorov and 
Zabih (2004). 

The idea behind the Ford-Fulkerson algorithm is the following. Suppose 
that liquid is flowing from source to sink in the graph with nodes V and 
pipes E with capacities c(-,-). Take a piece of chalk and draw an arrow 
in the direction of the flow over the pipes with positive capacity that are 
not totally filled. Draw an arrow in the direction opposite to the flow over 
the pipes carrying some liquid. Of course there may be pipes with arrows 
in both directions! Now try to walk from the source to the sink, always 
following your arrows. When you arrive at a dead end, return to the source. 
If you never arrive to the sink, the flow is maximal and the nodes y that 
you have not visited define a cut C{y) with minimal capacity. If you arrive 
to the sink, you can increase the total flow by e by increasing it by e in the 
pipes that you have walked forward from the source, and decreasing it by 
the same amount in the pipes that you walked backward. It turns out that 
the number of operations necessary to find the minimal cut is polynomial in 
the number of nodes. 

5. Generalizations. 

The one-sample test. Given a sample t of a random tree with law vr, the 
one-sample test is 

(22) i?o:vr = 7r', Ha'-t^^t^' 

for a given probability vr' on T . The BFFS statistic in this case is given by 

(23) VF(t) :=sup|J(t,t) -7rd(t)|, 

where 

(24) T:d{t) := Ti{t')d{t' ,t) 

t'eT 

is the expected distance between t and a random tree with law vr. The graph 
computation of W{t) is done exactly as in Section 4, but in the definition 
(15) of C the mean occupation value t'{v) is substituted by 

(25) fx^,{v):=Y,7r'{t)t{v), 

t 

the average occupation number of node v under vr'. 
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More general trees. The BFFS test works for finite or infinite rooted 
trees contained in a full tree V. V must satisfy that the number of children 
per parent is uniformly bounded by m (say). This condition is necessary 
for the Proposition 4.1, which transforms the problem of minimizing the 
difference of the distances on the minimization of a Hamiltonian. The nodes 
in V can be coded with finite sequences of letters of the alphabet A = 
{1, . . . , m} in such a way each node is coded with the sequence coding of his 
father plus a letter of A. In our case the letter is added at the beginning of 
the sequence so that the sequence corresponding to a parent is a suffix of 
those corresponding to its children. Alternatively, the letter can be added 
at the end; in this case the parents sequences are prefixes of the children. 
Any other labeling would work in the same way, as only the structure of the 
tree (and not the labeling of the nodes) is relevant in the construction of the 
test. The structure of the tree is the set of nodes and the set of edges; with 
our coding the set of edges is deduced from the node coding: = {(a{, ag)}; 
otherwise E must be explicitly defined. 

If V is infinite, V is truncated to nodes with at most L symbols and 
calling ttl the law of the truncated tree, the null hypothesis is Hq : ttl = tt^. 

6. Numerical results. 

6.1. Testing protein related populations of trees. In this subsection we 
present some results obtained by applying the two-sample test over protein 
domain families of the Pfam-A database. As mentioned in the Introduction, 
our framework is the following: 

• Each family of protein domains induce a (different, hopefully) proba- 
bility distribution vr on the space of trees T. 

• Given two families J- and J-"', we consider their associated signatures, that 
is, the probability laws vr and vr' on the space T. 

• For each family J^j we take a sample of protein sequences of size uj , and for 
each sequence in the sample we construct the PST context tree estimator, 
as described in Section 2. We obtain a sample of size nj of i.i.d. random 
elements on T with distribution ttj. 

• Finally, for each pair of families Tj,J-j' we test if both distributions nj 
and vTj/ are the same. 

To test the approach, we randomly choose families J^i,...,^io whose 
names start with letter A and such that their average lengths are larger 
than 150 amino acids (this last condition was to ensure some precision in 
the context tree estimation step). In order to guarantee the quality of the 
samples, we only choose sequences in the seed of each family. The chosen 
families are ABC-2membrane, ABC-membrane, Amidase, Amidino-trans, 
AMME-CRl, AOX, ArgK, ASC, Asp-Arg-Hidrox and Asp-Al-Ex. 
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We randomly select rij = 50 sequences from each family and compute 
the associated PST context tree estimator of each sequence using the PST 
algorithm with parameters r = 1.05 [as Bejerano (2001)] and L = 4. In this 
way we obtain a sample of 50 trees per family. 

We consider the distance function = ^ ^jjgpg ^j^e function gen(u) 

is defined as the length of the sequence labeling node v. That is, if node v 
corresponds to sequence ai, then gen(t!) = k. For each 9 G {0.001,0.01,0.35} 
we run the BFFS test for each pair of families using the corresponding 
samples of trees. We also run the tests under the null hypothesis collecting 
two independent samples from the same family. For each pair of samples of 
trees we estimate the (1 — a)-quantile under the null hypothesis using Monte 
Carlo randomization [Manly (2007)], that is, we permute the pooled sample 
a thousand times and compute the test statistic for each of the replicates us- 
ing half of the permuted pooled sample for each population. The estimated 
quantile is therefore the empirical (1 — a)-quantile for the vector of size 1000 
built up in this way. 

All 45 tests are rejected at level a = 0.001 for the three values of 9. Despite 
the conservative level used in the tests, the hypothesis of equal distribution 
is rejected in all cases when the samples came from different Pfam-A fami- 
lies, confirming the discriminative power of the context trees associated to 
the sequences. In the case of the same family, but with independent samples 
of trees, for 9 = 0.35 we observe p- values ranging from 0.15 to 0.87, com- 
patible with the uniform distribution (the law of the p- values under the null 
hypothesis). Similar results were obtained with the other two values of 9. 

6.2. Simulation results. We also challenge our method in a small Monte 
Carlo simulation for Galton-Watson processes, for three different models, 
parameters and sample sizes. The results are reported in Appendix A. 3. 

7. Final remarks. We perform the BFFS method to test if two samples 
of context trees come from different distributions, and we propose a feasible 
way to compute its statistic, allowing the treatment of reasonably big trees. 
The test rejects the null hypothesis in the case of high quality, manually 
curated Pfam families, and it does not reject on random subsets of the same 
family. This supports the use of the test as a method to distinguish different 
groups of protein domains when a specific task, as, for example, sequence 
annotation, does not give conclusive results. 

Our results strongly indicate that the context trees associated to protein 
domain sequences are sufficient to discriminate between different families in 
the Pfam database. In this sense we have benefited from ideas coming from 
the analysis of sequences related to linguistics. Calves and collaborators have 
proposed with success the use of context trees to discriminate languages 
from codified written text [Calves et al. (2004)]. More recently, they have 
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used similar ideas in a preliminary work to study the phylogeny of protein 
sequences [Leonardi et al. (2007)]. 

We emphasize that the test is not restricted to the analysis of samples of 
context trees. Any space of trees satisfying the assumptions of Section 5 will 
be suitable for using our approach. On the other hand, for particular dis- 
tributions like Galton-Watson processes, more simple tailor-made tests can 
be developed. Our simulations show that the BFFS test is able to distin- 
guish between distributions determined by the node-marginal distributions, 
which is a large family of distributions for applications. This class includes 
tree laws with a Markovian hypothesis, as shown in Proposition A. 2. We 
discuss this item in detail in Appendix A.l. 

8. Computing notes. The code to compute the test statistic is available 
from Jorge R. Busch (jbusch@fi.uba.ar) upon request. Calculation reported 
here used Scilab INRIA (http://www.scilab.org/) and C++ code from Be- 
jerano (2004) and Kolmogorov and Zabih (2004). 

The computational burden for our algorithm allows us to work with trees 
with up to 3^^ nodes. Each p-value involves 1001 test statistic calculations, 
with a sample size n = 50 for each family, taking at most 15 minutes to be 
complete, with a Pentium Core 2 duo with 2Gb of RAM memory. 

APPENDIX 
A.l. Mean distances and Markovian hypotheses. 

Do the mean distances determine a measure? Recall the mean distance 
nd is defined in (24). Our test is universally consistent within the class of 
distributions for which {'jrd{t),t G T) determine the probability vr. That is, 
whatever is the law of the families, the test will asymptotically detect the dif- 
ference. We show in Lemma A.l that ird determines the law of the marginals 
(T{v),v € V). Proposition A. 2 says that, under Markov type hypotheses, the 
marginal distributions determine the measure. 

For a random tree T with law vr recall ^-wiv) is the mean occupancy node 
V defined in (25) and define o"^(f), the variance of T{v), by 

(A.l) al{v)=fi^{v){l-fi^{v)). 

Lemma A.l. Let vr and vr' be measures on T. Then, vrd = vt'c? if and 
only if fi^ = fi^, . 

The proof is given later in this section. 

Different measures may have the same mean distances. For instance, con- 
sider ^ = {1,2} and vr, vr' defined by 

7r(0) = 1; 7r({A}) = 7r({A, 1}) = ^({A, 2}) = 7r({A, 1,2}) = i, 
7r'(0) = l; 7r'({A}) = ^'({A,l,2}) = i. 
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Then, /ijrl'f) = fJ-^'iv) for all v £V. 

Lemma A.l implies that in general the functions nd and ir'd do not help 
to solve the discrimination problem. But in some cases these functions do 
discriminate. To show that, we need some extra notation. For a set / of 
nodes denote Tj the restriction of T to / and Tj = 1 means that T{v) = 1 
for all V (z I, while T/ = means that T{v) = for all v £ I. 

Let f be a node, and a,b & A. We shall call v father of av, av son of v, 
and av brother of bv. Let f -V \ {A} ^ y be a function such that, for each 
V ^ X, f{v) is father or brother of v, and f^{v) = A for some n = n{v) G N. 
Notice that, in this case, f~^{v) is empty or formed up by brothers and sons 
of V. We call such a function a tree-shift. Consider, for instance, the function 
/ that assigns to a node its father. 

Let T be a random tree with law vr. We say that T satisfies a Markov 
hypothesis if there exists a tree-shift / such that if v = f{w), then 

(a) < iJ,n{w) < iJ,n{v) < 1, 

(b) ¥{T{w) = l\T{v) = 0) = 0, 

(c) let /, J C y be such that if t; G /, -u; ^ /U J and P(T/ = 1, Tj = 0) > 0, 
then 

(A.2) F{T{w) = l\Ti = l,Tj = 0)=¥{T{w) = l\T{v) = 1). 

Proposition A. 2. Under the Markov hypotheses, the marginals {fin{v),v G 
V) determine the probabilities (7r(t),tGT). 

The proof is given later in this section. 

Examples of measures satisfying the Markov hypotheses. The alphabet 
for the following examples is A = {1, . . . , m}. 

1. Let / be defined by f{av) = v, a G A. Let k{v) be such that 
jfc(i))-i^^-j _ £qj, V eV. li Ht,{v) =p'=('^) (0 < p < 1), we obtain a tree with 
7r({A}) = p, and when T{v) = 1, T[av) is Bernoulli with parameter p, for 
a G A. We call such tree distributions pseudo Galton-Watson processes. 

2. Let / be defined by /(If) = v, and /((a + l)v) = av for 1 < a < m. 
That is, v = f{w) if w is the eldest brother and v is the father of or if v 
is the nearest older brother of v. Let now PO) • • • iPm be given probabilities, 
with po > and po + • • ■ + Pm, = 1 . If 

lJL.„{lv) = (pi H hPm)/i7r(l'), 

/J-niicL + 1)^^) = ^""'"^ ^-^^LLTr(av) (1 < a < m — 1), 

Pa + ---+Pm y - - 

we obtain the classical Galton-Watson process, with parameter probabilities 
Po,...,Pm- 
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If vr is a distribution on T, and T is a random tree with distribution vr, 
then by a simple computation, 

(A.3) TTdit) = J2 - t{v)f + E <l^{v)al{v) 

(A.4) = (t>{v)l^.{v){l - 2t{v)) + mt{v). 

Proof of Lemma A.l. Notice first that from (A.4) it follows that 
(A.5) 7rd{t) - 7r'd{t) = ^ cP{v)Mv) " f^.'iv)){l - 2tiv)), 

which implies that if fJ,n{v) = Hn'iv) for all v (zV, then Trd{t) = Tr'd{t). This 
proves sufficiency. 

To prove necessity, we proceed by induction. When ird^t) = 7r'd(t) for all 
t (^T, from (A.5) we obtain 

= E Hy)Mv) - f^n'{v)){i - 2t{v)) 

(A.6) 

vet v0 
for all t ^T. Letting t = 0, the empty tree, and t = {A} in (A.6), we obtain 

(A.7) 0= X</.(7;)(/i,(?;) -//,/(?;)), 

(A.8) = -4>{\){^iA\) - i^^'ix)) + Y Hv)Mv) - i^n'iv)). 

Substracting (A.7) and (A.8) and using that (p{v) > 0, we get fi-jrW = IJ-w'W- 

Inductive step. Let t ^T, and h ^V\ {t} such that t U {h} G T. We show 
that if /i7r(f) = Htt'{v) for all G t, then ^7r(/i) = Hn'ih). First, we obtain from 
(A.6) 

(A.9) = YHv)Mv)-l^n'iv)), 

(A.IO) = -(t>{h){fi^{h)-fi^,{h))+ Y (t>{y)My)-f^.'{v)), 

v0U{h} 

and it follows that ^7r(/j) = /i,r'(^)- ^ 
It is easy to prove the following lemma 
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Lemma A. 3. If T is a random tree satisfying the Markov hypotheses 
with tree-shift f , then, given T{y) = 1, the variables T{w) :w € f~^{v) are 
independent. Furthermore, ifv = f[w), 

(A.ll) p(TH = l|r(7;) = l) = ^^4^. 

Proof of Proposition A. 2. First, notice that 
(A.12) ^(0) = 1-^,(A). 

From Lemma A. 3, 

(A.13) vr({A}) = ^.(A) [] " ^ 

Let t €T \ {0, V} and /i be a node such that h ^ t and v = f{h) G t. We 
shall show that 

(A.14) ^(tU{/i}) =7r(t)- ^"^^^ 



First, we have 

7r(t U {h}) = = 1, T{h) = 1, r(,u{/,})c = 0) 
(A.15) = P(r(/i) = llTt = l,T(tu{M)= = 0)^(^* = l'7^(iu{M)^ = 0) 

= P(T(/i) = l\T{v) = l)P(Tj = 1, T(,u{M)= = 0). 

On the other hand, 

Ti{t) = = 1, T{h) = 0, T^tyj{h}Y = 0) 
(A.16) = ¥{T{h) = ^\T{v) = l)¥{Tt = 1, T(t^{h}Y = 0) 

= (1 - P(r(/i) = 1|T(^;) = l))P(Ti = 1, T(,u{M)c = 0). 

From (A.15) and (A.16) it follows that 

nT(h) = l\T{f{h)) = l) 



(A.17) 7r(tU{/i}) =7r(i) 



l-¥{T{h) = l\T{f{h)) = l) 



This shows (A.14). Our main statement follows now by induction from 
(A.13) and (A.14), noticing that any finite tree may be constructed from 
{A} in this way. □ 
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A. 2. Redefining the minimization problem. In this subsection we prove 
Propositions 4.1 and 4.2. Call A{v) = {t{v) -t'{v)) so that 

(A.18) Cit) = Y.<l>{v)Aiv)tiv). 

Recall the space of configurations 3^ = {0,1}^. Trees minimizing C will 
also minimize C + [iV for all positive (3: If t G T n argmin^gj; then 
t G argmiuyg^; (£(?/) + (3V{y)) for all /? > 0. On the other hand, if (3 is big 
enough, we expect the configurations minimizing C + (3V to be trees. Since 
on the set of trees the form V vanishes, the minimizing trees should also 
minimize C. This is proven in the following lemma. 

Lemma A. 4. // /? > X^t-gy . ^/^en 
(A. 19) argmin£(t) = argmin(£(y) + j3'P{y)). 

Proof. Observe that minimizing configurations in 3^ satisfy 

(A.20) y'Gargmin£(y) if and only if y'(t;) = | ^' jf < 

Let £min and £max be the values of the minimum and maximum of £ over 
3^, that is, 

(A.21) Cmin= E Hv)M^)^ ^m..= J2 

v:A{v)<0 v:A(v)>0 

Notice that 

(A.22) £„,ax - C^in = E '^(") 1^(^)1 < E <^(^)- 

v&V v&V 

For all y G 3^ it holds 
(A.23) Ciy) + pV{y)>C^,^ + P ^ ^ y(at;)(l - 

v:y{v)=0 a£A 

If y is not a tree, there exists v £ V and a £ A such that y{v) =0 and 
y{av) = 1, hence, 

(A.24) C{y) + f3V{y) > C^^in + P> C^in + E '^(^) ^ ^"^a- 

by (A.22). On the other hand, C{t) + (3V{t) = C{t) < £max for any tree 
t. Hence, for these values of /?, if y is not a tree, then C{y) + (3V{y) > 
maxfg7-(>C(t) + PV{t)) and the result follows. □ 

Proof of Proposition 4.1. It follows from (14), (15) and the above 
lemma applied to C and —C □ 
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Proof of Proposition 4.2. The proof follows from a simple algebra. 
More generally, the Hamiltonian TC-.y given by 

(A.25) niy) = J2H%yiv))+ ^ H^^^ iyiv),yH) 

v^V v,w£V 

is said regular if the quadratic terms satisfy 

(A.26) H''''"{0,0) + H''''"{1,1) < H''''"{0,1) + H''''"{1,0). 

Theorem 4.1 of Kolmogorov and Zabih (2004) says that regular Hamiltonians 
are graph representable, meaning that it is possible to associate capacities 
to the graph (V,E) defined in (18) in such a way that 

(A.27) C{y) = k + n{y), y^y, 

where C{y) is the capacity of the cut defined by y [see (21)] and /c is a 
constant. 

The Hamiltonian C + [3V is regular because H^'""^ {t{v),t{av)) = (3t{av){l — 
t{v)) satisfy (A.26). The graph (18) with capacities (19) is the Kolmogorov 
and Zabih representation of the Hamiltonian C + f3V. □ 

A. 3. Simulation results. In this subsection we perform the test for the 
two-sample problem (10) using samples t and t' with distribution vr and vr' 
on T, trees with maximal depth L = 8. We compute the BFFS statistic W 
given in (12) using the approach of Section 4. We use the distance (8) with 
(j){v) = 6is™(^) for various values of 6. Since in this case we know vr and vr', 
we estimate the quantiles directly by Monte Carlo simulation, as follows: 

Quantile 

1. Generate two samples of size n both from ^vr + ^vr', a fair mixture of the 
laws. Label them sample 1 and sample 2. Compute the test statistic W 
using the samples. 

2. Repeat the above procedure a fixed number of times N . 

3. Order the computed statistics values increasingly and define the quantile 
q{l — a) as the statistic in place (1 — a)N . 

4. Calculate the quantiles for several values of a and sample sizes n. 

Power 

1. Generate sample 1 from vr, sample 2 from vr' and compute W using them. 

2. Compare the obtained value against the quantile, and reject the null 
hypothesis with level a MW > q{\ — a). 

3. Repeat the last two steps a fixed number of times and compute the per- 
centage of rejections for each value of a as a measure of the power of the 
test. 



TESTING STATISTICAL HYPOTHESIS ON RANDOM TREES 



19 



Table 1 



Model 1. Powe 


r of the 


tests with 


p = 0.5 


and p' — 
125 


0.6, 0.7, O.f 


■i, sample 


size n = 


31, 51, 




n = 31 






n = 51 




n = 125 




a/p' 0.6 


0.7 


0.8 


0.6 


0.7 


0.8 


0.6 


0.7 


0.8 


0.01 0.101 
0.05 0.242 
0.1 0.325 


0.504 
0.708 
0.806 


0.923 
0.986 
0.998 


0.122 
0.313 
0.426 


0.744 
0.889 
0.935 


0.999 

0.9999 

0.9999 


0.466 
0.661 
0.743 


0.994 
0.999 
0.9999 


0.9999 
0.9999 
0.9999 



Model 1: Binomial. Let vr be the law of a Galton- Watson process 
with offspring distribution Binomial (2, and vr' is the same with parame- 
ter p'. We use p = 0.5 and p' = 0.6, 0.7, 0.8. Table 1 shows the percentage 
of rejection over 1000 tests of level a = 0.10, 0.05, 0.01 for sample sizes 
n = 31, 51, 125. The results show the consistency of the BFFS test for alter- 
natives with any value of p 7^ 0.5. For this simple model, small sample sizes 
are enough to get high power. 

Model 2: Mixture of Binomials. Let vr be the law of a Galton- 
Watson process with offspring distribution a mixture of Binomials. Inde- 
pendently at each node with probability q use a Binomial(2,pi), otherwise 
a Binomial(2,p2)- For vr' we use q' , p'l and But we take q = q' = 0.5 in 
the examples. 

Model 2.1. Take pi = 0.45, p2 = 0.5 and p[ = 0.1, P2 = 0.85. Figure 2 
shows histograms of the test statistic values obtained in 1000 iterations, 
for sample sizes 31 (left) and 131 (right) respectively. We have plotted the 
test statistic under null hypothesis on red and the one under the alter- 
native on blue. The distance parameter was 9 = 0.35. The expected mean 
at each node is the same because pi + P2 = v'\ + v'2 ! the distributions un- 
der the null and alternative hypothesis are close to each other but the 
BFFS test needs only a moderate sample size to give high power to the 
test. In Table 2 we evaluate the power of the test for three different val- 
ues of the test level (a, as 0.01, 0.05 and 0.1), in the same way as in 
Table 1. The left plot on Figure 3 shows 5 curves of 1000 p- values each, 
for sample sizes = 31, 51, 71, 101, 131, and the distance parameter Q = 
0.35. The right one shows the same results when the distance parameter is 
changed to = 0.49. Increasing the parameter Q decreases the power of the 
test. 

Model 2.2. The parameters p\ = 0.3, p2 = 0.65, Pi = 0.45 and p'2 = 0.5 
give a different scenario, since the distributions of the populations are very 
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Fig. 2. Model 2.1. Histogram of the test statistic under null hypothesis (black plot) and 
alternative (white plot), with sample sizes 31 and 131. 
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Fig. 3. Model 2.1. p-values calculated 1000 times; sample sizes 31, 51, 11 and 131. 
Left plot: parameter = 0.35. Right plot: parameter 9 = 0.49. Increasing the parameter 9 
decreases the power of the test. 
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Fig. 4. Model 2.2. Histogram of the test statistic under null hypothesis (black plot) and 
alternative (white plot); left plot size — 50 and right plot size — 500. Parameter 9 — 0.35. 




100 200 300 400 500 600 700 600 900 1000 50 100 150 200 250 300 350 400 450 500 



Fig. 5. Model 2.2. Left plot: p-values computed 1000 times, each curve related to a dif- 
ferent sample size 50, 250 and 500. Right plot: percentage of rejection as a function of 
the sample size, each curve computed with a different a level, 0.01 in dotted line, 0.05 in 
dashed line and 0.1 in interpolated line. Parameter 9 = 0.35. 



22 



J. R. BUSCH ET AL. 



Table 2 

Model 2.1. Power of the tests of level a for 1000 replications, with a = 0.05, 0.01, 0.1. 
Sample sizes are 31, 51, 71, 101, 131. Parameter S = 0.35 



a 


31 


51 


71 


101 


131 


0.01 


0.045 


0.108 


0.325 


0.594 


0.819 


0.05 


0.288 


0.363 


0.747 


0.887 


0.973 


0.1 


0.438 


0.642 


0.857 


0.976 


0.990 



close. Figure 4 shows slow changes in the empirical distribution as the sample 
size grows. At the left is the histogram of the 1000 values of the test statistic 
under the null (red plot) and alternative hypothesis (blue plot) for sample 
size = 50. For the right histogram the sample size is = 500. In Figure 5 
we consider larger sample sizes for Model 2.2. We fix ^ = 0.35 and plot the p- 
values computed over 1000 replications for sample sizes A^ = 50, 250 and 500 
(left plot) and the percentage of rejection as a function of the sample size, 
each curve computed with a different a level. Besides sample fluctuations, 
the percentage of rejection increases with sample size, but at a quite slow 
rate. 
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